Skip to content

CLN: break read_json, read_msgpack API, disallow string data input #5954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

CLN: break read_json, read_msgpack API, disallow string data input #5954

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 15, 2014

revisiting #5655 due to recent #5874, it's just not right.
most pandas read_* methods take a path/url of file like. the newish read_json/read_msgpack
also accept a (byte)string of data, and tries to guess whether it's data or a filepath.

That creates weird corner cases and is sufficently wrong IMO that it's worth
making a breaking change. Users will now have to wrap their data in BytesIO/StringIO.

not much of a problem, except perhaps the lost convenience of pd.read_json(j). But since json
usually comes from a file or url which are supported directly I'm hoping it won't be
that disruptive.

0.14.0 ofcourse.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

tell me again why this is a problem?

@ghost
Copy link
Author

ghost commented Jan 15, 2014

Because it overloads too much. the filename typo yielding a json parse error and the 150k filename
didn't convince you this isn't good API design?

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

I just think its conveient...

e.g.

pd.read_json(....) rather than pd.read_json(StringIO(...))

i though you were going to have a kw that 'fixed' this, e.g. string= ?

@ghost
Copy link
Author

ghost commented Jan 15, 2014

I don't remember suggesting that. We've discussed this...?

@jreback
Copy link
Contributor

jreback commented Jan 15, 2014

no...was suggesting that (or maybe pd.read_jsons)

@ghost
Copy link
Author

ghost commented Jan 15, 2014

Yeah, I think the load/loads convention is pretty familiar, that'd be preferable and
more convenient then a keyword anyway. Alternatively reads_json.

@jtratner
Copy link
Contributor

I'd vote for string= kwarg over anything else, just because for non-C
people it might be clearer. But I don't feel strongly

@ghost
Copy link
Author

ghost commented Jan 16, 2014

I don't follow the non-C bit.

Loking at it, read_html goes yet a third way, which is url/data/file-like.
Not much consistency in read_*, but different usage profiles I guess.

How about we remove data string input from read_msgpack (being binary)
and remove filename input from json, so it matches read_html? (so read_html(open(foo))
or with). I don't mind which way the ambiguity is resolve, just that is.

If we move to load/loads, we'd still have to deal with read_html which would either
have to break or be inconsistent with everything else. Not great.

Note again that neither simplejson normsgpacknor stdlibpickle` have this
absurd amount of overloading.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2014

related to #5957

actually json/msgpack/html are consistent NOW (csv is not though).

why again is this a problem; it does the right thing with strings / files

what are the edge cases you are referring? a mistyped filename gives a JSON error?

@ghost
Copy link
Author

ghost commented Jan 16, 2014

I explained the rationale several times already, going over it again isn't likely to help.
Retracted.

@ghost ghost closed this Jan 16, 2014
@ghost ghost deleted the PR_fix_read_json_msgpack branch January 16, 2014 11:26
@ghost
Copy link
Author

ghost commented Jan 16, 2014

and read_html does not accept filenames except as url, so they are not consistent.

@cpcloud
Copy link
Member

cpcloud commented Jan 16, 2014

Then that's a bug. Last time I checked, read_html could read files. IIRC there are tests for that.

@ghost
Copy link
Author

ghost commented Jan 16, 2014


def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, infer_types=None, attrs=None, parse_dates=False,
              tupleize_cols=False, thousands=','):
    r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.

    Parameters
    ----------
    io : str or file-like
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.

So at least the docstring suggests it's reasonable.

Et tu, Brute? sigh

@cpcloud
Copy link
Member

cpcloud commented Jan 16, 2014

Oh whoops I responded too hastily, I'm not sure if read_html accepts plain filenames a la path/to/some/html.html. IIRC resolving that ambiguity wasn't worth it because you can just with open() as f: it.

@ghost
Copy link
Author

ghost commented Jan 16, 2014

Thank you for that. To me it's obvious this should change, but considering there's so much
resistance, I've let it go. moving on.

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2014

In the interest of completeness, and at the risk of beating a dead horse, read_html does the following before reading in data:

  1. Checks if the string is a url
  2. Not a URL? Checks for the existence of a read attribute (e.g., StringIO and file objects)
  3. Doesn't have a read attribute? Checks if the string is an existing file.
  4. Not a file? Check if it's a raw string
  5. Not a raw string? It's a TypeError.

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2014

Of course if it's a non existent path then it will be treated as a raw string...which is the issue.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants